In [1]:
%%javascript
    MathJax.Hub.Config({
      TeX: { equationNumbers: { autoNumber: "False" } }
    });
    MathJax.Hub.Queue(
  ["resetEquationNumbers",MathJax.InputJax.TeX],
  ["PreProcess",MathJax.Hub],
  ["Reprocess",MathJax.Hub]
);


Why Radial Basis Function Neural Networks are equivalent to Nystrom method

Radial Basis Function Neural Networks (RBFN) and Nystrom methods are two methods that can be used for regression. They come from different statistical learning frameworks: RBFN were proposed in the context of shallow neural networks: this is a three layer neural network where the hidden layer computes the rbf similarities to a set of centroids. Then a fully connected layer is added to give the final prediction. Nystrom method was proposed in the kernel machine literature as a way to reduce the computational burden of the full kernel similarity matrix $K$. Nystrom was proposed latter, however it can be applied to different kernel approaches, not only regression.

The equivalence of the method is not something unkown that we came up to; however we feel it is not cristal clear the relationship as perhaps seems for experts in the field. In (Quiñonero-Candela 2005) and in Rassmussen book this relations are taken from granted. In addition another clear proof of the people awareness of this relationship is the scikit learn implementation of Nystrom method, which explicitly uses this relationship. In this way the purpose of this post is to shed light in the proof of this relationship which helps to better understand this poweful methods. First we show the relation into an empirical risk minimization setting, later we will do the bayesian approach which leads to the so called sparse $\mathcal{GP}$ methods, the RBFN network is called in this context the Subset of Regressors (SoR) method. Sparse $\mathcal{GP}$s is an active field of research which have had many contributions over the last years.

$ \mathcal{N} $

Introduction

Let $\mathcal{D}=\{x_i,y_i\}_{i=1}^N$ be the data of our regression problem: $x_i\in\mathbb{R}^D$, $y_i\in \mathbb{R}$. In matrix notation we write $\mathcal{D}=(X,y)$ where $X$ is an $N\times D$ matrix and $y$ and $N \times 1$ vector. Both methods employ a subset of $\mathcal{D}$ of size $M$ $\mathcal{D}_u = (X_u,y_u)$ ($X_u$ is an $M\times D$ matrix and $y_u$ and $M \times 1$). This subset is sometimes called centroids, pseudo-inputs or inducing inputs. Also, we will sometimes refer with the subscript $f$ to the subset formed by all the data: $\mathcal{D}=\mathcal{D_f}=(X_f,y_f)=(X,y)$

As we are working with kernels we suppose we have a kernel function $k(x_i,x_j)\in \mathbb{R}$ that measures similarities between points $x_i\in\mathbb{R}^D$. We will define the distance between 1 point $x_i$ and a bunch of points $X_u$ as the row vector: $k(x_i,X_u) = K_{x_iu} =[k(x_i,x_{u_1}),k(x_i,x_{u_2}),...,k(x_i,x_{u_M})] \in \mathbb{R}^{1\times M}$. Similarly we can define the matrix generated by the distances between a bunch of points $X_u$ and $X_f$ as: \begin{equation} k(X_f,X_u)=K_{fu} = \begin{pmatrix} k(x_1,x_{u_1}) & k(x_1,x_{u_2}) &...& k(x_1,x_{u_M}) \\ k(x_2,x_{u_1}) & k(x_2,x_{u_2}) &...& k(x_2,x_{u_M}) \\ \vdots & \vdots & \ddots &\vdots \\ k(x_N,x_{u_1}) & k(x_N,x_{u_2}) &...& k(x_N,x_{u_M}) \\ \end{pmatrix} \in \mathbb{R}^{N\times M} \end{equation}

Nystrom method

Nystrom method proposes the substitution of the kernel function $k$ with $k_{nystrom}$ defined as $k_{nystrom}(x_i,x_j) = k(x_i,X_u)K_{uu}^{-1}k(X_u,x_j)$. The matrix $K$ ($K_{ff}$) is replaced then with $Q=K_{fu}K_{uu}^{-1}K_{uf}$ ($Q_{ff}$).

The KRR (mean of GP) solution using the kernel function $k_{nystrom}$ for a given $x^*$ is:

$$ \begin{equation} k_{nystrom}(x^*,f)\Big(Q_{ff}+ \sigma I \Big)^{-1} y = K_{*u} \overbrace{K_{uu}^{-1}K_{uf}\Big(Q_{ff}+ \sigma^2 I \Big)^{-1}}^{A} y \end{equation} $$

RBFN method

The (RBFN) method is an empirical risk minimization approach originally proposed by Poggio and Girosi 1990 (eq 25). In this approach we have first to map $X$ to the space generated by the kernel similarities to $X_u$ which for the training data yields the matrix $K_{fu}$ of (1).

Then we solve the linear ridge regression problem in the transformed domain $K_{fu}$. That is, we seek to minimize the empirical risk $J$ for $\alpha$:

$$ J(\alpha) = \| K_{fu}\alpha - y\|^2 + \sigma^2 \alpha^t K_{uu} \alpha = \alpha^t K_{uf}K_{fu} \alpha - 2 \alpha^t K_{uf} y + y^ty + \sigma^2 \alpha^t K_{uu}\alpha $$

Solving $\nabla_{\alpha} J = 0$ yields into: $$ \alpha_{opt} = \Big( K_{uf}K_{fu}\ + \sigma^2 K_{uu}\Big)^{-1} K_{uf}y $$

The prediction for a given $x^*$ is then: \begin{equation} K_{*u}\cdot \alpha_{opt} = K_{*u} \overbrace{\Big( K_{uf}K_{fu}\ + \sigma^2 K_{uu}\Big)^{-1} K_{uf}}^{B} y \end{equation}

We see here the three layer network: the first layer is the input, the second layer is formed by the kernel similarities to $X_u$ and the third one: the output $y$.

Matrix inversion lemma proof

The easiest way to prove the equivalence between both methods (which is used in Rassmusen and Williams Chapter 8 section 6.1.) is to rely on the matrix inversion lemma which states:

$$ \Big(Z + UWV^t\Big)^{-1} = Z^{-1} -Z^{-1} U\Big(W^{-1} + V^tZ^{-1}U\Big)^{-1}V^tZ^{-1} $$

Hence we can apply this formula into the inverse of eq (2) $\Big( \overbrace{K_{fu}K_{uu}^{-1}K_{uf}}^{Q_{ff}}+\sigma^2 I\Big)^{-1}$ to get eq (3).

Proof

The ad. hoc way to prove the result is showing that the matrices $A$ from (2) and $B$ from (3) are equivalent. We can simplify the expresions:

$$ A = K_{uu}^{-1}K_{uf}\Big(Q_{ff}+ \sigma^2 I \Big)^{-1} = K_{uu}^{-1}K_{uf}\Big(K_{fu}K_{uu}^{-1}K_{uf}+ \sigma^2 I \Big)^{-1} $$$$ \require{cancel} B = \Big( K_{uf}K_{fu}\ + \sigma^2 K_{uu}\Big)^{-1} K_{uf} = K_{uu}^{-1}\Big( K_{uf}K_{fu} K_{uu}^{-1} + \sigma^2 I \cancel{K_{uu}K_{uu}^{-1}} \Big)^{-1} K_{uf} $$

Thus $A = B$ lead us to:

\begin{equation} \require{cancel} \cancel{K_{uu}^{-1}} K_{uf}\Big(K_{fu}K_{uu}^{-1}K_{uf}+ \sigma^2 I \Big)^{-1} = \cancel{K_{uu}^{-1}} \Big( K_{uf}K_{fu} K_{uu}^{-1} + \sigma^2 I \Big)^{-1} K_{uf} \end{equation}

Multiplying each part of the equation (4) at the left by $\Big( K_{uf}K_{fu} K_{uu}^{-1} + \sigma^2 I \Big)$ and at the right by $\Big(K_{fu}K_{uu}^{-1}K_{uf}+ \sigma^2 I \Big)$ we have:

$$ \Big( K_{uf}K_{fu} K_{uu}^{-1} + \sigma^2 I \Big) K_{uf} = K_{uf} \Big(K_{fu}K_{uu}^{-1}K_{uf}+ \sigma^2 I \Big) $$

Which is true since

$$ \Big( K_{uf}K_{fu} K_{uu}^{-1} + \sigma^2 I \Big) K_{uf} = K_{uf}K_{fu} K_{uu}^{-1}K_{uf} + \sigma^2 K_{uf} = K_{uf} \Big(K_{fu}K_{uu}^{-1}K_{uf}+ \sigma^2 I \Big) $$

(This trick is taken from http://stat.wikia.com/wiki/Kernel_Ridge_Regression).

Bayesian approach

Nystrom

The Nystrom approach, in the context of $\mathcal{GP}$s, change the prior over $(y,y^*)$ with $p(y,y^* | \: X_*, X)=\mathcal{N}\left(\begin{pmatrix}y \\ y^* \end{pmatrix} \Big| \; 0,\:\begin{pmatrix} Q_{f,f}+\sigma^2 I & Q_{f,*}\\ Q_{*,f} & Q_{*,*} + \sigma^2 I \end{pmatrix} \right)$. Here we want to retrieve the posterior predictive distribution $p(y^* | X_*, X, y)$ from this joint prior. If we dust off Christopher Bishop's Machine Learning and Pattern Recognition book and we go to chapter 2 section 2.3.1. about Conditional Gaussian distributions we have the derivation that gives the conditional distribution from a joint Gaussian distribution.

Applying this equation (which is also here in the wikipedia) leads to the expression of the posterior predictive distribution:

\begin{equation} p(y^* | X_*, X, y) = \mathcal{N}\big(f^* \mid Q_{*,f}(Q_{f,f}+\sigma^2 I)^{-1}y,\enspace Q_{*,*} - Q_{*,f}(Q_{f,f} + \sigma^2 I)^{-1}Q_{f,*} \big) \end{equation}

Bayesian RBFN

The RBFN method can be alsow viewed from a standard bayesian linear regression point of view. If we assume the model $y_i = K_{x_iu}\alpha + \epsilon_i$ where $\epsilon_i\sim \mathcal{N}(0,\sigma^2)$. Which is equivalent to $p(y_i\mid x_i,\alpha) = p(y_i \mid K_{x_iu},\alpha)=\mathcal{N}(y_i\mid K_{x_iu}\alpha, \sigma^2)$. The likelihood of our model is then:

\begin{equation} L(\alpha) = p(\overbrace{y_1,...,y_N}^y \mid \overbrace{x_1,...,x_N}^X,\alpha) = \prod_{i=1}^N p(y_i\mid x_i,\alpha) = \prod_{i=1}^N \mathcal{N}(y_i \mid K_{x_iu}\alpha , \sigma^2) = \mathcal{N}(y \mid K_{fu}\alpha, \sigma^2 I ) = \frac{1}{(2\pi)^{N/2}\sqrt{|\sigma^2I|}}exp\big(-\frac{1}{2}(y- K_{fu}\alpha)^t \sigma^{-2}I(y- K_{fu}\alpha)\big) = \frac{1}{(2\pi \sigma^2)^{N/2}}exp\big(-\frac{1}{2\sigma^2}\|y- K_{fu}\alpha\|^2\big) \end{equation}

If we place a prior over $\alpha\sim \mathcal{N}(0,A)$, the joint $y,\alpha$ distribution given $X$ would be:

$$ p(y,\alpha \mid X) = p(y\mid X,\alpha)p(\alpha \mid X) = L(\alpha)\mathcal{N}(\alpha\mid 0,A) $$

At this point it is worth to stop and consider what do we want? the natural answer is give predictions for newcomming $x$ values. This means in bayesian language that we want a posterior predictive distribution which is a probability distribution over the $y^*$ values given all the available information: $p(y^* |\: X_*, X, y) $.

One way to obtain the posterior predictive distribution, which is similar to the approach followed for $\mathcal{GP}$s is:

  1. Augment the above equation with the test points: $p(y,y^*,\alpha \mid X,X_*)$ (It will just change the likelihood term).
  2. Integrate out $\alpha$: $ p(y,y^* \mid X,X_*) = \int p(y,y^*,\alpha \mid X,X_*) d\alpha $
  3. Compute the posterior predictive using the trick we used above in the Nystrom case (5).

The other, more standard approach would be:

  1. Compute the posterior for $\alpha$: $p(\alpha \mid y,K_{fu}) = p(y,\alpha\mid K_{fu})/p(y\mid K_{fu}) $ (This can be done since all the formulas involve gaussian distributions).
  2. Compute the posterior predictive distribution integrating out $\alpha$:
$$ p(y^* \mid x^* X, y) = \int p(y^* \mid X^*, \alpha, \cancel{X, y}) P(\alpha \mid \cancel{X^*}, X, y) d \alpha = \int \mathcal{N}(y^*\mid K_{*u}\alpha, \sigma^2) p(\alpha \mid y,K_{fu})d \alpha $$

We will further develop this approach in the fully bayesian approach section. However first we will consider the maximum a posteriori (MAP) approach which we''ll see it is equivalent to the empirical risk minimization approach.

MAP approach

The "less bayesian approach" would be to compute the maximum a posteriori estimate of $p(\alpha \mid y,K_{fu})$ and then use such value ($\alpha_{MAP}$) to compute predictions: $y^* = K_{*,u}\alpha_{MAP}$. This is the approach which is equivalent to the empirical risk minimization approach exposed above (if we use $A=K_{uu}^{-1}$):

$$ \begin{aligned} \alpha_{MAP} = \arg \max_{\alpha} p(\alpha \mid y,K_{fu}) &= \arg \max_{\alpha} \frac{p(y,\alpha \mid K_{fu})}{p(y\mid K_{fu})} = \arg \max_{\alpha} p(y,\alpha \mid K_{fu}) = \arg \max_{\alpha} \big[\log(p(y,\alpha \mid K_{fu}))\big] = \\ &= \arg \min_{\alpha}\Big[ -\log(L(\alpha)) - \log(\mathcal{N}(\alpha \mid 0,A)) \Big] = \\ &= \arg \min_{\alpha} \Big[ \frac{1}{\cancel{2}\sigma^2} \|y- K_{fu}\alpha\|^2 + \cancel{\frac{N}{2}\log(2\pi \sigma)} + \cancel{\frac{1}{2}}\alpha^t A^{-1} \alpha + \cancel{\frac{N}{2}\log(2\pi)}+ \cancel{\frac{1}{2}\log(|A|)} \Big] = \arg \min_{\alpha} J(\alpha) \end{aligned} $$

Fully bayesian approach

We will develop here the "similiar to $\mathcal{GP}$" approach to derive the posterior predictive distribution $p(y^*\mid x_*,X,y)$. First we consider the augmented likelihood:

\begin{equation} p(y,y^* \mid X, X_*, \alpha) = \mathcal{N}\left(\begin{pmatrix} y \\ y^* \end{pmatrix} \Big | \; \begin{pmatrix} K_{fu} \\ K_{*u} \end{pmatrix} \alpha, \sigma^2 I \right) \end{equation}

Which is the same expression that (6) but including the test data $(X_*,y^*)$.

Now we integrate out $\alpha$: $$ p(y,y^* \mid X,X_*) = \int p(y,y^* \alpha \mid X,X_*) d\alpha = \int p(y,y^* \mid X,X_*,\alpha)p(\alpha \mid X,X_*) d\alpha = \int \mathcal{N}\left(\begin{pmatrix} y \\ y^* \end{pmatrix} \Big | \; \begin{pmatrix} K_{fu} \\ K_{*u} \end{pmatrix} \alpha, \sigma^2 I \right) \mathcal{N}(\alpha \mid 0, A) d\alpha $$

To solve this (nasty) integral we can rely again on Bishop's book in this case we will have to move to section 2.3.3. Bayes' theorem for Gaussian variables. However, if we believe that the convolution of two gaussians is gaussian we can just find the mean and covariance of $(y,y^* \mid X, X_*)$.

First we compute the mean:

$$ \mathbb{E}[(y,y^*) \mid X, X_*] = \mathbb{E}[\mathbb{E}[y,y^* \mid X, X_*,\alpha]] \underbrace{=}_{eq. 7} \mathbb{E}\left[\begin{pmatrix} K_{fu} \\ K_{*u} \end{pmatrix} \alpha + \epsilon \right] = \begin{pmatrix} K_{fu} \\ K_{*u} \end{pmatrix} \mathbb{E}[\alpha] + 0 \underbrace{=}_{\alpha \text{ has mean zero}} 0 $$

Then we compute the covariance:

$$ \mathbb{E}\left[\begin{pmatrix} y \\ y \end{pmatrix} \begin{pmatrix}y^t & y^{*t} \end{pmatrix} \Big | \; X, X_*\right] = \mathbb{E}\left[\mathbb{E}\left[ \begin{pmatrix} y \\ y \end{pmatrix} \begin{pmatrix}y^t & y^{*t} \end{pmatrix} \Big | \; X, X_*,\alpha \right]\right] = \mathbb{E}\left[ \begin{pmatrix} K_{fu} \\ K_{*u} \end{pmatrix} \alpha \alpha^T \begin{pmatrix}K_{uf}^t & K_{u*}^t \end{pmatrix} + \epsilon \epsilon^t \right] = \begin{pmatrix} K_{fu} \\ K_{*u} \end{pmatrix} \mathbb{E}\left[\alpha \alpha^T\right] \begin{pmatrix}K_{uf}^t & K_{u*}^t \end{pmatrix} + \sigma^2 I \\ = \begin{pmatrix} K_{fu} \\ K_{*u} \end{pmatrix} A \begin{pmatrix}K_{uf}^t & K_{u*}^t \end{pmatrix} + \sigma^2 I = \begin{pmatrix} K_{fu}AK_{uf} & K_{fu}AK_{u*} \\ K_{*u}AK_{uf} & K_{*u}AK_{u*} \end{pmatrix} + \sigma^2 I $$

Now we recognize here again that if $A=K_{uu}^{-1}$ we have the same joint distribution of the train test data:

$$ p(y,y^* \mid X,X_*) = \mathcal{N}\left(\begin{pmatrix}y \\ y^* \end{pmatrix} \Big| \; 0,\:\begin{pmatrix} K_{fu}AK_{uf} + \sigma^2 I & K_{fu}AK_{u*} \\ K_{*u}AK_{uf} & K_{*u}AK_{u*} + \sigma^2 I \end{pmatrix} \right) $$

Using the same procedure than in equation (5) and assuming $A=K_{uu}^{-1}$ we retrieve back the same predictive posterior as in Nystrom method.